NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Resource-Efficient Model Adaptation Methods for Personalized Speech Enhancement Systems

Sivaraman, Aswin (May 2024, Indiana University)

Full Text Available
The Potential of Neural Speech Synthesis-Based Data Augmentation for Personalized Speech Enhancement

https://doi.org/10.1109/ICASSP49357.2023.10096601

Kuznetsova, Anastasia; Sivaraman, Aswin; Kim, Minje (June 2023, Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing)

Full Text Available
Efficient Personalized Speech Enhancement Through Self-Supervised Learning

https://doi.org/10.1109/JSTSP.2022.3181782

Sivaraman, Aswin; Kim, Minje (October 2022, IEEE Journal of Selected Topics in Signal Processing)

Full Text Available
Zero-Shot Personalized Speech Enhancement Through Speaker-Informed Model Selection

https://doi.org/10.1109/WASPAA52581.2021.9632752

Sivaraman, Aswin; Kim, Minje (October 2021, 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA))

This paper presents a novel zero-shot learning approach towards personalized speech enhancement through the use of a sparsely active ensemble model. Optimizing speech denoising systems towards a particular test-time speaker can improve performance and reduce run-time complexity. However, test-time model adaptation may be challenging if collecting data from the test-time speaker is not possible. To this end, we propose using an ensemble model wherein each specialist module denoises noisy utterances from a distinct partition of training set speakers. The gating module inexpensively estimates test-time speaker characteristics in the form of an embedding vector and selects the most appropriate specialist module for denoising the test signal. Grouping the training set speakers into non-overlapping semantically similar groups is non-trivial and ill-defined. To do this, we first train a Siamese network using noisy speech pairs to maximize or minimize the similarity of its output vectors depending on whether the utterances derive from the same speaker or not. Next, we perform k-means clustering on the latent space formed by the averaged embedding vectors per training set speaker. In this way, we designate speaker groups and train specialist modules optimized around partitions of the complete training set. Our experiments show that ensemble models made up of low-capacity specialists can outperform high-capacity generalist models with greater efficiency and improved adaptation towards unseen test-time speakers.
more » « less
Full Text Available
Personalized Speech Enhancement Through Self-Supervised Data Augmentation and Purification

https://doi.org/10.21437/Interspeech.2021-1868

Sivaraman, Aswin; Kim, Sunwoo; Kim, Minje (August 2021, Proceedings of the Interspeech 2021)

Training personalized speech enhancement models is innately a no-shot learning problem due to privacy constraints and limited access to noise-free speech from the target user. If there is an abundance of unlabeled noisy speech from the test-time user, one may train a personalized speech enhancement model using self-supervised learning. One straightforward approach to model personalization is to use the target speaker’s noisy recordings as pseudo-sources. Then, a pseudo denoising model learns to remove injected training noises and recover the pseudo-sources. However, this approach is volatile as it depends on the quality of the pseudo-sources, which may be too noisy. To remedy this, we propose a data purification step that refines the self-supervised approach. We first train an SNR predictor model to estimate the frame-by-frame SNR of the pseudo- sources. Then, we convert the predictor’s estimates into weights that adjust the pseudo-sources’ frame-by-frame contribution to- wards training the personalized model. We empirically show that the proposed data purification step improves the usability of the speaker-specific noisy data in the context of personalized speech enhancement. Our approach may be seen as privacy-preserving as it does not rely on any clean speech recordings or speaker embeddings.
more » « less
Full Text Available

Search for: All records